This script takes a deep dive into Landsat 5 labels for a more rigorous analysis of inconsistent band data and outliers in the filtered label dataset. Here we will determine if any more label data points should be removed from the training dataset and whether or not we can glean anything from the metadata in the outlier dataset to be able to pre-emptively toss out scenes when we go to apply the classification algorithm.
harmonize_version = "v2024-04-17"
outlier_version = "v2024-04-17"
LS5 <- read_rds(paste0("data/labels/harmonized_LS57_labels_", harmonize_version, ".RDS")) %>%
filter(mission == "LANDSAT_5")
Just look at the data to see consistent (or inconsistent) user-pulled data and our pull, here, our user data are in “BX” format and the re-pull is in “SR_BX” format. These are steps to assure data quality if the volunteer didn’t follow the directions explicitly.
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
There is some mis-match here, let’s look at those data, in this case, we’ll just use B7/SR_B7 as a reference to filter inconsistent labels
LS5_inconsistent <- LS5 %>%
filter((is.na(SR_B7) | SR_B7 != B7))
LS5_inconsistent %>%
group_by(class) %>%
summarise(n_labels = n()) %>%
kable()
| class | n_labels |
|---|---|
| cloud | 213 |
| darkNearShoreSediment | 1 |
| lightNearShoreSediment | 2 |
| openWater | 9 |
| other | 1 |
| shorelineContamination | 2 |
Most of these are cloud labels, where the pixel is saturated, and then masked in the re-pull value (resulting in an NA). Let’s drop those from this subset and then look more.
LS5_inconsistent <- LS5_inconsistent %>%
filter(!(class == "cloud" & is.na(SR_B7)))
This leaves 0.8% of the Landsat 5 labels as inconsistent. Let’s do a quick sanity check to make sure that we’ve dropped values that are inconsistent between pulls:
LS5_filtered <- LS5 %>%
filter(# filter data where SR_B7 has data and where the values match between the two
# pulls.
(!is.na(SR_B7) & SR_B7 == B7) |
# or where the user-specified class is cloud and the pixel was saturated
# providing no surface refelctance data
(class == "cloud" & is.na(SR_B7) ),
# or where any re-pulled band value is greater than 1, which isn't a valid value
if_all(LS57_ee,
~ . <= 1))
And plot:
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
There’s still a couple of mis-matched data points in here, seen in B4, so we’ll put those over into inconsistent, too.
LS5_inconsistent <- LS5_filtered %>%
filter(B4 != SR_B4) %>%
bind_rows(., LS5_inconsistent)
LS5_filtered <- LS5_filtered %>%
filter(B4 == SR_B4)
And now let’s look at the data by class:
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
We aren’t actually modeling “other” (not sufficient observations to classify) or “shorelineContamination” (we’ll use this later to block areas where there is likely shoreline contamination in the AOI), so let’s drop those categories and look at the data again.
LS5_for_class_analysis <- LS5_filtered %>%
filter(!(class %in% c("other", "shorelineContamination")))
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
Let’s also go back and check to see if there is any pattern to the inconsistent labels.
| vol_init | n_tot_labs | n_dates |
|---|---|---|
| HAD | 8 | 4 |
| BGS | 7 | 3 |
| LRCP | 5 | 2 |
| AMP | 1 | 1 |
There seem to be just a few inconsistencies here and across multiple dates. This could just be a processing difference (if there happened to be an update to a scene since users pulled these data or if these were on an overlapping portion of two scenes). I’m not concerned about any systemic errors here that might require modified data handling for a specific scene or contributor.
There are statistical outliers within this dataset and they may impact the interpretation of any statistical testing we do. Let’s see if we can narrow down when those outliers and/or glean anything from the outlier data that may be applicable to the the application of the algorithm. Outliers may be a systemic issue (as in the scene is an outlier), it could be a user issue (a user may have been a bad actor), or they just might be real. This section asks those questions. The “true outliers” that we dismiss from the dataset will also be used to help aid in interpretation/application of the algorithm across the Landsat stack, so it is important to make notes of any patterns we might see in the outlier dataset.
## [1] "Classes represented in outliers:"
## [1] "darkNearShoreSediment" "lightNearShoreSediment" "offShoreSediment"
## [4] "openWater"
Okay, 84 of extreme outliers (>1.5*IQR) out of 1626 total labels - and they are all from non-cloud groups.
Are there any contributors that show up more than others in the outliers dataset?
| vol_init | n_tot | n_out | percent_outlier |
|---|---|---|---|
| BGS | 592 | 59 | 10.0 |
| AMP | 58 | 5 | 8.6 |
| LRCP | 115 | 8 | 7.0 |
| KRW | 32 | 2 | 6.2 |
| HAD | 352 | 10 | 2.8 |
| FYC | 63 | NA | NA |
| LAE | 15 | NA | NA |
These aren’t terrible. At/below 10%, nothing egregious.
How many of these outliers are in specific scenes?
| date | vol_init | n_out | n_tot | percent_outlier |
|---|---|---|---|---|
| 1988-11-23 | BGS | 28 | 67 | 41.8 |
| 1993-06-30 | HAD | 9 | 24 | 37.5 |
| 1998-10-02 | BGS | 15 | 98 | 15.3 |
| 2010-09-01 | AMP | 5 | 58 | 8.6 |
| 1998-08-31 | LRCP | 8 | 115 | 7.0 |
| 2000-08-04 | BGS | 5 | 74 | 6.8 |
| 1984-07-07 | KRW | 2 | 32 | 6.2 |
| 2006-07-04 | BGS | 3 | 58 | 5.2 |
| 2009-04-23 | BGS | 2 | 53 | 3.8 |
| 2004-08-15 | BGS | 2 | 55 | 3.6 |
| 2006-06-02 | BGS | 3 | 105 | 2.9 |
| 2007-05-04 | HAD | 1 | 57 | 1.8 |
| 2010-04-10 | BGS | 1 | 79 | 1.3 |
There are two scenes here that have very high outliers (> 20% of total labels) - perhaps there is something about the AC in these particular scenes? or the general scene quality? Let’s look at the scene-level metadata:
| date | vol_init | CLOUD_COVER | IMAGE_QUALITY | DATA_SOURCE_AIR_TEMPERATURE | DATA_SOURCE_ELEVATION | DATA_SOURCE_OZONE | DATA_SOURCE_PRESSURE | DATA_SOURCE_REANALYSIS | DATA_SOURCE_WATER_VAPOR |
|---|---|---|---|---|---|---|---|---|---|
| 1988-11-23 | BGS | 3, 3, 23, 43, 43 | 9, 9, 9, 9, 9 | NCEP | GLS2000 | TOMS | NCEP | MERRA-2 | NCEP |
| 1993-06-30 | HAD | 63.0, 63.0, 70.5, 78.0, 78.0 | 7, 7, 8, 9, 9 | NCEP | GLS2000 | TOMS | NCEP | MERRA-2 | NCEP |
This isn’t terribly helpful, there is a range of
IMAGE_QUALITY, there are ranges of
CLOUD_COVER, otherwise metadata is basically the same (but
to broad to make a distinction for exclusion) While I feel fine dropping
these two scenes from the training dataset, we need a systematic way to
make sure we don’t apply the algo to similar scenes. If there isn’t
anything common in the metadata, we would need to mask pixels by band
ranges that are present in the training set (as those are the ones we
have confidence in) or flag pixels that we aren’t confident in after the
fact (because they are outside of the range used for the training
dataset).
How many bands are outliers when the data are aggregated back to label? If there is a large portion of outliers amongst the RGB bands (how users labeled data), there is probably a systemic problem. If the outliers are in singular bands, especially those that are not in the visible spectrum, we can dismiss the individual observations, and probably assert that the scene as a whole is okay to use in training. First pass, if there are 3 or more bands deemed outliers for a particular label, let’s look at the bands that are outliers:
| date | class | vol_init | user_label_id | n_bands_out | bands_out |
|---|---|---|---|---|---|
| 1988-11-23 | openWater | BGS | 4085 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 4086 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 4087 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 4088 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 4089 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 4091 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 4094 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 4095 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 4096 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 4097 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 4098 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 4100 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 4101 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 4102 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 4103 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 4080 | 3 | SR_B1; SR_B2; SR_B3 |
| 1988-11-23 | openWater | BGS | 4081 | 3 | SR_B1; SR_B2; SR_B3 |
| 1988-11-23 | openWater | BGS | 4082 | 3 | SR_B1; SR_B2; SR_B3 |
| 1988-11-23 | openWater | BGS | 4084 | 3 | SR_B1; SR_B2; SR_B4 |
| 1988-11-23 | openWater | BGS | 4090 | 3 | SR_B1; SR_B2; SR_B3 |
| 1988-11-23 | openWater | BGS | 4099 | 3 | SR_B1; SR_B2; SR_B4 |
| 1998-08-31 | darkNearShoreSediment | LRCP | 3036 | 3 | SR_B4; SR_B5; SR_B7 |
| 1998-08-31 | darkNearShoreSediment | LRCP | 3127 | 3 | SR_B4; SR_B5; SR_B7 |
| 1998-10-02 | offShoreSediment | BGS | 3832 | 3 | SR_B4; SR_B5; SR_B7 |
| 1998-10-02 | offShoreSediment | BGS | 3833 | 3 | SR_B4; SR_B5; SR_B7 |
| 1998-10-02 | offShoreSediment | BGS | 3834 | 3 | SR_B4; SR_B5; SR_B7 |
| 2006-06-02 | lightNearShoreSediment | BGS | 3612 | 3 | SR_B4; SR_B5; SR_B7 |
| 2006-07-04 | lightNearShoreSediment | BGS | 4271 | 3 | SR_B4; SR_B5; SR_B7 |
| 2006-07-04 | lightNearShoreSediment | BGS | 4288 | 3 | SR_B4; SR_B5; SR_B7 |
| 2007-05-04 | darkNearShoreSediment | HAD | 4071 | 3 | SR_B4; SR_B5; SR_B7 |
Let’s group by image date and volunteer and tally up the number of labels where at least 3 bands where outliers:
| date | vol_init | n_labels |
|---|---|---|
| 1988-11-23 | BGS | 21 |
| 1998-10-02 | BGS | 3 |
| 1998-08-31 | LRCP | 2 |
| 2006-07-04 | BGS | 2 |
| 2006-06-02 | BGS | 1 |
| 2007-05-04 | HAD | 1 |
At the very least, there are issues in the 1988-11-23 scene… but is that user error? Atmospheric correction?
This looks super hazy, so that may be the issue with this particular scene. Unfortunately, there is no way to know this from the scene metadata. The SR_ATMOS_OPACITY value is low (this is supposed to indicate the skies are clear), the cirrus confidence is low (but it’s also low for a majority of the labels in this dataset). Reanalysis for this, or dense dark vegetation estimation for aerosol stand-in in LEDAPS may be the culprit - or it could be something to do with surrounding bright snow-covered land that impacts the Rrs values of pixels. This is probably a reason to avoid categorizing LS5 images during winter months, and that bright snow along the shoreline may cause issues with classification.
Do any of the labels have QA pixel indications of cloud or cloud shadow? The first pass here is for all data that don’t have a label of “cloud” (not just outliers). Let’s see if the medium certainty classification in the QA band is useful here:
LS5_for_class_analysis %>%
mutate(QA = case_when(str_sub(QA_PIXEL_binary, 1, 2) %in% c(10, 11) ~ "cirrus",
str_sub(QA_PIXEL_binary, 3, 4) %in% c(10, 11) ~ "snow/ice",
str_sub(QA_PIXEL_binary, 5, 6) %in% c(10, 11) ~ "cloud shadow",
str_sub(QA_PIXEL_binary, 7, 8) %in% c(10, 11) ~ "cloud",
TRUE ~ "clear")) %>%
group_by(QA) %>%
filter(class != "cloud") %>%
summarize(n_tot = n()) %>%
kable()
| QA | n_tot |
|---|---|
| cirrus | 11 |
| clear | 43 |
| cloud | 1103 |
| cloud shadow | 8 |
| snow/ice | 62 |
Well, that’s not helpful, considering that most of the label dataset has at least a medium confidence cloud QA flag as the pixel QA. Now with only high confidence clouds:
LS5_for_class_analysis %>%
mutate(QA = case_when(str_sub(QA_PIXEL_binary, 1, 2) == 11 ~ "cirrus",
str_sub(QA_PIXEL_binary, 3, 4) == 11 ~ "snow/ice",
str_sub(QA_PIXEL_binary, 5, 6) == 11 ~ "cloud shadow",
str_sub(QA_PIXEL_binary, 7, 8) == 11 ~ "cloud",
TRUE ~ "clear")) %>%
group_by(QA) %>%
filter(class != "cloud") %>%
summarize(n_tot = n()) %>%
kable()
| QA | n_tot |
|---|---|
| cirrus | 11 |
| clear | 1165 |
| cloud | 2 |
| cloud shadow | 4 |
| snow/ice | 45 |
Okay, if we use only “high confidence”, this seems a little better. Let’s look at the labels where pixels were classified as snow/ice:
LS5_snow_ice <- LS5_for_class_analysis %>%
filter(str_sub(QA_PIXEL_binary, 3, 4) == 11,
class != "cloud") %>%
group_by(date, vol_init) %>%
summarise(n_snow_ice = n()) %>%
arrange(-n_snow_ice)
LS5_tot <- LS5_for_class_analysis %>%
group_by(date, vol_init) %>%
filter(class != "cloud") %>%
summarise(n_tot_labels = n())
left_join(LS5_snow_ice, LS5_tot) %>%
mutate(percent_snow_ice = round(n_snow_ice/n_tot_labels*100, 1)) %>%
arrange(-percent_snow_ice) %>%
kable()
| date | vol_init | n_snow_ice | n_tot_labels | percent_snow_ice |
|---|---|---|---|---|
| 1987-11-05 | HAD | 21 | 41 | 51.2 |
| 1990-06-06 | HAD | 7 | 25 | 28.0 |
| 1998-10-02 | BGS | 9 | 98 | 9.2 |
| 1993-06-30 | HAD | 2 | 24 | 8.3 |
| 2004-08-15 | BGS | 2 | 55 | 3.6 |
| 2011-10-06 | HAD | 2 | 111 | 1.8 |
| 2007-05-04 | HAD | 1 | 57 | 1.8 |
| 1988-11-23 | BGS | 1 | 67 | 1.5 |
We don’t want to be using data that’s flagged as snow/ice nor are we going to classify pixels that are snow/ice, so let’s see what classes are represented in the pixels with snow/ice QA designations. Let’s look at that image on 1987-11-05 to see if there is visible ice:
I don’t see instances of ice/snow here… I suspect that when the QA bit misclassifies as snow/ice, there may be a systemic issue (maybe from the bright clouds?) To be conservative, I think we should drop this scene, especially since more than 50% of the labels have a snow/ice flag associated with them. I don’t believe that the QA pixel is used to assign atmospheric correction, so I think only dropping the egregious one here is acceptable.
Let’s also look at the cirrus flagged pixels:
LS5_for_class_analysis %>%
filter(str_sub(QA_PIXEL_binary, 1, 2) == 11,
class != "cloud") %>%
group_by(date, vol_init) %>%
summarise(n_cirrus = n()) %>%
arrange(-n_cirrus) %>%
left_join(., LS5_tot) %>%
mutate(percent_cirrus = round(n_cirrus/n_tot_labels*100, 1)) %>%
arrange(-percent_cirrus) %>%
kable()
| date | vol_init | n_cirrus | n_tot_labels | percent_cirrus |
|---|---|---|---|---|
| 1993-06-30 | HAD | 8 | 24 | 33.3 |
| 1998-10-02 | BGS | 2 | 98 | 2.0 |
| 2009-04-23 | BGS | 1 | 53 | 1.9 |
Let’s look at the 1993-06-30 image:
Yeah, that looks pretty hazy to me! What does the scene metadata say?
LS5_for_class_analysis %>%
filter(date == "1993-06-30") %>%
select(date, vol_init, CLOUD_COVER:DATA_SOURCE_WATER_VAPOR) %>%
distinct() %>%
kable()
| date | vol_init | CLOUD_COVER | IMAGE_QUALITY | DATA_SOURCE_AIR_TEMPERATURE | DATA_SOURCE_ELEVATION | DATA_SOURCE_OZONE | DATA_SOURCE_PRESSURE | DATA_SOURCE_REANALYSIS | DATA_SOURCE_WATER_VAPOR |
|---|---|---|---|---|---|---|---|---|---|
| 1993-06-30 | HAD | 63.0, 63.0, 70.5, 78.0, 78.0 | 7, 7, 8, 9, 9 | NCEP | GLS2000 | TOMS | NCEP | MERRA-2 | NCEP |
Ooo, yeah, there are some clouds there and some lower quality images. We should probably drop any labels that have a cirrus QA flag that aren’t cloud labels.
outliers %>%
mutate(QA = case_when(str_sub(QA_PIXEL_binary, 1, 2) == 11 ~ "cirrus",
str_sub(QA_PIXEL_binary, 3, 4) == 11 ~ "snow/ice",
str_sub(QA_PIXEL_binary, 5, 6) == 11 ~ "cloud shadow",
str_sub(QA_PIXEL_binary, 7, 8) == 11 ~ "cloud",
TRUE ~ "clear")) %>%
group_by(QA) %>%
filter(class != "cloud") %>%
summarize(n_out_tot = n()) %>%
kable()
| QA | n_out_tot |
|---|---|
| cirrus | 6 |
| clear | 72 |
| cloud | 2 |
| cloud shadow | 2 |
| snow/ice | 2 |
Let’s look into those cirrus QA bits:
outliers %>%
filter(str_sub(QA_PIXEL_binary, 1, 2) == 11,
class != "cloud") %>%
group_by(date, vol_init) %>%
summarise(n_out_cirrus = n()) %>%
arrange(-n_out_cirrus) %>%
kable()
| date | vol_init | n_out_cirrus |
|---|---|---|
| 1993-06-30 | HAD | 6 |
And there is the 1993-06-30 image from earlier. Again, just toss the labels that are flagged as cirrus from the QA bit.
How many of these outliers have near-pixel clouds (as measured by ST_CDIST)?
There are 10 labels (11.9% of oultiers) that aren’t “cloud” in the outlier dataset that have a cloud distance <500m and 108 labels (6.6%) in the whole dataset that have a cloud distance <500m. Since this is about the same portion of labels (or they are not severely disproportionate), I don’t think this is terribly helpful.
How many of the outliers have high cloud cover, as reported by the scene-level metadata? Note, we don’t have the direct scene cloud cover associated with individual labels, rather a list of the scene level cloud cover values associated with the AOI.
The outlier dataset contains 9 (10.7%) where the max cloud cover was > 75% and 9 (10.7%) where the mean cloud cover was > 50%. The filtered dataset contains 24 (1.5%) where max was >75% and 47 (2.9%) where the mean cloud cover was > 50%. While there is a greater instance of higher CLOUD_COVER in the outliers, it’s not a large enough portion of the outlier dataset to say that we should just toss scenes of either case above.
Pixels can also be saturated in one or more bands, we need to make sure that the QA_RADSAT for all labels (including clouds) are set to zero.
LS5_for_class_analysis %>%
mutate(radsat = if_else(QA_RADSAT == 0,
"n",
"y")) %>%
group_by(radsat) %>%
summarize(n_tot = n()) %>%
kable()
| radsat | n_tot |
|---|---|
| n | 1626 |
Great! No bands are saturated!
For the purposes of training data, I think we can throw out the data from the 1988-11-23 and 1987-11-05 scenes, any non-cloud classes that are flagged as snow/ice or cirrus. All other outliers should be retained, in my opinion.
LS5_training_labels <- LS5_for_class_analysis %>%
filter(!(date %in% c("1988-11-23", "1987-11-05")),
!(str_sub(QA_PIXEL_binary, 1, 2) == 11 & class != "cloud"),
!(str_sub(QA_PIXEL_binary, 3, 4) == 11 & class != "cloud"))
We do want to have an idea of how different the classes are, in regards to band data. While there are a bunch of interactions that we could get into here, for the sake of this analysis, we are going to analyze the class differences by band.
Kruskal-Wallis assumptions:
ANOVA assumptions:
We can’t entirely assert sample independence and we know that variance and distribution is different for “cloud” labels, but those data also are visibly different from the other classes.
In order to systematically test for differences between classes and be able to intepret the data, we will need to know some things about our data:
With this workflow, most classes are statistically different - below are the cases where the pairwise comparison were not deemed statistically significant:
## # A tibble: 7 × 9
## band group1 group2 n1 n2 statistic p p.adj p.adj.signif
## <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 SR_B2 darkNearShoreS… offSh… 135 287 0.217 0.828 1 ns
## 2 SR_B4 darkNearShoreS… light… 135 260 2.45 0.0144 0.144 ns
## 3 SR_B5 darkNearShoreS… light… 135 260 2.69 0.00724 0.0724 ns
## 4 SR_B5 darkNearShoreS… offSh… 135 287 -1.62 0.104 1 ns
## 5 SR_B7 darkNearShoreS… light… 135 260 2.18 0.0294 0.294 ns
## 6 SR_B7 darkNearShoreS… offSh… 135 287 -1.55 0.122 1 ns
## 7 SR_B7 offShoreSedime… openW… 287 403 -1.91 0.0556 0.556 ns
There is some consistency here: “darkNearShoreSediment” is often not different from other sediment types by band. It is entirely possible that band interactions overpower these non-significant differences.
DNSS: dark near shore sediment, LNSS: light near shore sediment, OSS: offshore sediment
There are definitely some varying patterns here, let’s zoom in on the sediment classes.
DNSS: dark near shore sediment, LNSS: light near shore sediment, OSS: offshore sediment
Okay, this seems sensical as to why there is no significant difference in many of the band-level “darkNearShoreSediment” labels - there’s a lot of overlap in ranges. Looking at these scatter plot matrices, I do think there are likely different enough patterns when considering multiple bands that ML should be able to pick up on subtle differences.
Things to note for Landsat 5:
when neighboring land is covered in snow, it’s possible that the Rrs data will be brighter than usual (but our n = 1 here, so this is more of a warning)
pixels with a QA bit of cirrus or snow/ice should be masked when applying the algorithm
a combination of higher cloud cover and lower image quality may be a reason to exclude a scene (again, n = 1)
write_rds(LS5_training_labels, paste0("data/labels/LS5_labels_for_tvt_", outlier_version, ".RDS"))